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''Module networks'' are a framework to learn gene regulatory networks from expression data using 
a probabilistic model in which coregulated genes share the same parameters and conditional distribu- 
tions. We present a method to infer ensembles of such networks and an averaging procedure to extract 
the statistically most significant modules and their regulators. We show that the inferred probabilistic 
models extend beyond the data set used to learn the models. 
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I. INTRODUCTION 



Methods for reverse-engineering transcription regu- 
latory networks from high-throughput microarray data 
come in many different flavors [1.2.3.4.5]. An impor- 
tant class of methods are those which not only seek to 
identify the topological wiring of the network f6^, 'SJ, 
but also attempt to infer a model of the biological sys- 
tem which explains the observed gene expression pat- 
terns and generates testable hypotheses. Such mod- 
els can take the form of probabilistic graphical mod- 
els yj |9| [lOJ, simplified kinetic equation models [11], 
or biophysical models [5J. A common property of all 
modeling approaches is that the number of parameters 
is much larger than the number of experimental data 
points available to define them. Dimensionality reduc- 
tion is usually achieved by a coarse-graining step which 
collapses individual genes into clusters of coexpressed 
genes or modules, where all genes in a cluster share the 
same model parameters [12J. This conceptual simplifi- 
cation has as a drawback that inferred interactions are 
influenced by the module quality. Moreover, it is hard 
to translate the concept of a biological module in a strict 
mathematical definition. When searching for modules, 
often many local optima exist with partially overlapping 
modules differing from each other in a few genes or 
conditions. Therefore, in our approach we exploit the 
"fuzzy'' property of a module to increase the reliability 
of the predicted interactions. Instead of reporting only 
one cluster solution (local optimum), we use a stochas- 
tic approach to generate many partially redundant clus- 
ter solutions (bootstrapping) and generate an ensemble 
solution by averaging over multiple high-scoring mod- 
els. Crucial for the success of the ensemble approach 
is the availability of an efficient method for sampling 
a large number of different models covering the whole 
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search space of possible models (TSj. Therefore we use 
the deterministic approach of [ifl as a basis for our sam- 
pling method. In this approach gene and condition clus- 
tering are decoupled from learning the regulatory pro- 
grams compared to the original method of [9J which op- 
timizes both simultaneously. This allows for a higher 
efficiency while maintaining an equal performance rate 
[14J. Several extensions of the method of [9] infer tran- 
scriptional modules from gene expression and genome 
wide location analyis |15|[l6l|l7l[l8l. It is entirely fea- 
sible to develop ensemble strategies for these methods 
as we did for |9j. This constitutes an interesting line of 
future research. In this paper we give a brief descrip- 
tion of the algorithm and highlight some results for an 
expression compendium for E. coli [8J. A complete anal- 
ysis of this data set and comparison of our approach to 
the mutual information based CLR method [8J is given 
in 1 19 1 . A detailed comparison between the ensemble 
approach and the direct-optimization based method of 
111 on S. cerevisiae data is given in |20| . 



11. RESULTS AND DISCUSSION 

A. The algorithm 

The algorithm takes as input a gene expression data 
set and a list of candidate regulators. It gives as output a 
large number of probabilistic models consisting of a set 
of gene clusters, with for each gene cluster a partition 
of the experiments and a probabilistic regulatory pro- 
gram explaining the observed experiment partitions in 
terms of the expression of a small number of regulators. 
The number of gene and experiment clusters is deter- 
mined automatically and can vary from one solution to 
the next. Using overrepresentation in the ensemble, the 
most probable interactions can be identified. The regu- 
latory programs can be validated on new experimental 
data and generate testable hypotheses about conditional 
regulation of the inferred gene clusters. 



2 




FIG. 1: Graphical cartoon representation of the Gibbs sampling 
procedure for two-way clustering of genes and conditions [21 J. 
Each colored curve represent a random restart converging on 
a distinct local optimum in the direction of genes. In the direc- 
tion of experiments the whole search space can be covered in 
one run. 



The first step of the algorithm consists of generating 
an ensemble of gene clusters with experiment partitions. 
A Gibbs sampling method iteratively updates the as- 
signment of each gene given the current gene and ex- 
periment clusters, and the assignment of each experi- 
ment in each gene cluster given the current assignment 
of all other experiments in that gene cluster, iterating un- 
til a stationary state is reached. Details about the Gibbs 
sampler algorithm and a complete analysis of its conver- 
gence properties can be found in [21 J. Briefly, in a single 
run, the Gibbs sampler reaches a local optimum in the 
direction of genes, but covers the w^hole search space 
in the direction of experiments. This implies that for 
a given cluster of coexpressed genes there are multiple 
equiprobable ways of partitioning the experiments. To 
also sample from the whole search space in the direction 
of genes, we perform several independent Gibbs sam- 
pler runs with random restarts. In 1211 , it is shown that 
each of the local optima in the direction of genes is (ap- 
proximately) of the same height, and therefore equally 
important, and that a relatively small number of local 
optima is sufficient to cover the whole search space (typ- 
ically two distinct sets of 10 local optima for a data set of 
1000 genes agree for 95% on the probability for each pair 
of genes to be clustered together, see [21 J for details). A 
graphical cartoon representation of the Gibbs sampling 
procedure is given in Figure [T] 

In the second step of the algorithm, regulatory pro- 
grams are learned for each experiment partition for each 
gene cluster. This is achieved by linking the sets in the 
experiment partition hierarchically in a decision tree. 



For each split in this tree a candidate regulator is found 
whose expression is significantly different on both sides 
of the split, as measured by an entropy measure. Details 
can be found in fT4J. 
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FIG. 2: Example module with a regulatory program. The num- 
bers under the experiment clusters are the standard deviation 
and mean expression value of the data in the cluster. The num- 
bers under each tree node are the normalized Bayesian score 
gained in the Gibbs sampler by making this data split |21 1, and 
a percentage quality score for the assignment of this regulator 
1141. The colored bars on each tree node are the expression lev- 
els of the regulators with experiments sorted in the same order 
as in the experiment clusters. 

A gene cluster with a regulatory program is called a 
transcriptional module and a partition of all genes into 
clusters, each with a regulatory program, a module net- 
work. A sample module is shown in Figure |2] To each 
module network corresponds a probabilistic model de- 
fined by a probability density function . . . , Xjv) 
with Xi the continuous valued expression level of gene i 
(see Methods). The value of . . . , measures how 
well the model explains a particular experiment with ex- 
pression levels for all genes. 



B. Ensemble averaging 

We have applied the algorithm on a compendium of 
Affymetrix microarrays for E. coli [8]. The compendium 
contains 445 expression profiles for 4345 genes under 
189 different stress conditions and genetic perturba- 
tions. The results are validated by comparison with ex- 
isting knowledge of transcriptional interactions in Reg- 
ulonDB [22 J and EcoCyc |23|. A complete analysis of 
the inferred module networks and their biological sig- 
nificance is beyond the scope of this paper and will be 
given elsewhere [19]. Here we highlight a few examples 
of inferred interactions which are representative for the 
working mechanism of the ensemble approach. 

A first example is given by the gatYZABCD operon 
(involved in galactitol metabolism). As expected 
the genes in this operon consistently cluster together. 
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FIG. 3: Sample interactions discussed in the main text. Blue 
edges are present in RegulondDB, green edges connect to 
genes not in RegulonDB, and dashed edges are new predic- 
tions validated by literature. 

GatR_2, a subunit of the GatR transcriptional repres- 
sor, is consistently assigned as a candidate regulator. 
It is known that GatR negatively controls transcrip- 
tion initiation of the gatYZABCD operon, which is the 
only known target of GatR according to RegulonDB. 
The many local optima, generated by the Gibbs sam- 
pling approach resulted in 81 additional genes which 
were clustered at least once with one of the gatYZABCD 
genes. Except for one potential new target treC which 
was found repeatedly in combination with gatC and 
gatD, the interactions between the remaining 80 genes 
and GatR were not sufficiently significant to be retained 
in the final solution. This illustrates how the ensem- 
ble approach guarantees an effective filtering of false 
positives while retaining the true positives. This exam- 
ple also illustrates how combinatorial regulation can be 
detected. For gatY, gatZ, gatA and gatB, a second reg- 
ulator, YeaT (a predicted regulator involved in malate 
metabolism), with an anticorrelated expression pattern 
seemed to play a role in a subset of the conditions (see 
Figure [2|. 

Four interactions could be inferred for GadX (a tran- 
scriptional activator involved in acid resistance) : GadX 
itself, which makes sense as it is known to be autoreg- 
ulated ||24|, and three novel targets sZp, gadW and yhiD. 
This finding is supported by literature, as the expression 
of both sip [25J and yhiD [25] seemed to be affected in a 
gadX mutant, while gadX and gadW seem to tightly con- 
trol each others expression [26]. 

For Lrp (a regulator involved in the high-affinity 
transport of branched-chain amino acids and a medi- 
ator of the leucine response) 44 interactions could be 
found of which 5 could be confirmed by RegulonDB 
{ilvl, ilvH, livG, livM and UvN) and 39 were new. Among 
the predicted interactions with Lrp we found the leu- 



LABCD genes. According to literature, the Lrp depen- 
dent regulation of the leuLABCD operon is only indirect 
1271 . However, without additional data, no distinction 
can be made between direct and indirect effects of a reg- 
ulator if both give rise to a correlated expression level 
with the targets. Four of our predicted targets (purM, 
argA, yhjE and aroP) were tested by ChIP analysis 1 8 1 and 
three {purM, argA and yhjE) were confirmed. YhjE was 
also found to be differentially expressed in a microarray 
analysis of lrp mutants [28 J. For the remaining genes 
no clear indication for their regulation by Lrp could be 
found. However, it should be noted that Lrp not only 
acts as a regulator, binding specific DNA-sequences, but 
also functions as a DNA-organizing protein, extending 
its global role in regulation [29. 30 J. A large regulon of 
Lrp, as was detected by our method, thus is in line with 
this more global role of Lrp. 



C. Model evaluation 

One of the purposes of model-based reverse- 
engineering methods is to infer a model of the system 
which extends beyond the data set used to learn the 
model. As such they can form the basis for develop- 
ing methods which use new data to refine and extend a 
partially validated model, rather than infer a completely 
new network model each time a data set is updated. Val- 
idation of a model is done by comparing the distribution 
of ^ log p{xi, . . . ,xi^) (see Methods) for new data with 
the distribution for data used to learn the model. In gen- 
eral, higher values of log p means better explanation of 
the data by the model. The result of this comparison for 
75 experiments recently added to the M^^ database |8l 
with the 189 original experiments, for 100 module net- 
work models selected at random from the ensemble, is 
shown in Figure E] The overlap between the distribu- 
tions shows that the probabilistic models indeed gener- 
alize to unseen data. For comparison we also perform a 
randomization test by permuting the gene indices of the 
data matrix. As expected, the evaluation of the mod- 
els on randomized data is several orders of magnitude 
smaller than on real data, and in fact around 15% of the 
randomized experiments have a value log p = — oo, i.e., 
zero likelihood. 



III. METHODS 
A. Data sets 

We downloaded a compendium of expression pro- 
files for E. coli [8] and a list of 328 candidate regula- 
tors from http : //gardnerlab . bu . edu/net inf er _plosT| 
2007/ We averaged over replicate experiments to ob- 
tain a data matrix of 4345 genes and 189 experiments. 
Additional data for 4292 genes and 75 experiments for 
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FIG. 4: Histogram of jjlo^p{xi, . . . ,Xj^) for 100 models in 
the ensemble, for the original data (red), new data (blue), and 
randomized data (green, around 15% zero likelihood values 
(log = — oo) not shown). 



validation of the probabilistic models was downloaded 
from jhttp : //in3d . bu . edu/norm/] 

B. Probabilistic model evaluation 

The probabilistic model introduced in [9\ associates 
to each gene i a continuous valued random variable X/ 
measuring the gene's expression level. The distribution 
of X/ depends on the expression level of a set of parent 
genes chosen from a list of candidate regulators. Genes 
regulated by the same parents form a cluster and share 
the same model parameters. The joint probability distri- 
bution for the expression levels of all genes decomposes 
as a product of conditional distributions, 

K 

P(X1,...,X]V) = n n Pki^i I i^r' r e 7^J) , (1) 

k=l ieAk 

where {^^, fc = 1, . . . , X} is the set of clusters (i.e., a par- 
tition of the gene set {1, ... , N}), and TZ]^ is the set of reg- 
ulators for cluster k. The distribution ([l| is normalized 
if the network from parents to children is acyclic. The 
conditional distribution of the expression level of the 
genes in cluster /c is a normal distribution with parame- 
ters determined by the expression levels of the parents 



Pk{x \{xr:r e 7^J) = p{x \ }i^,T^) . (2) 

The parameters and are determined by arranging 
the parents in a decision tree. The tests on the nodes 
of the decision tree are of the form 'Xy > zl' for some 
threshold value z, where Xr is the expression value of 
the parent r associated to the node. The leaves ^ of the 
decision tree are the sets of an experiment partition for 
cluster k and ji^ and are the mean, resp. precision of 
the expression levels of the genes in the cluster in this 
subset of experiments. See [14 J for more details. 

To evaluate a model of the form ([l]), we are only inter- 
ested in genes for which the model makes actual predici- 
tions, i.e., genes belonging to clusters with a regulation 
tree. If the clustering procedure does not find distinct 
experiment clusters for a certain gene cluster, the model 
predicts one broad normal distribution for the genes in 
this cluster. Any expression data for these genes will fit 
the model and thereby obscure the signal of the genes 
for which true predictions are made. For the particu- 
lar data used in Section [nc) the number of genes in the 
new data set is 4292 vs. 4345 in the data used to learn the 
models. Six of the missing genes belong to the regulator 
list for the learned models. Hence in some models we 
may not be able to compute the conditional probabil- 
ity distributions for all genes. Altogether around 2700 
genes remain for each model. 



C. Software availability 

The Java software package LeMoNe for learn- 
ing module networks is available from our website 
http: //bioinformatics .psb.ugent .be/software/ 
details/LeMoNel 
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